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Abstract 

In binary-transaction data-mining, traditional frequent itemset mining of- 
ten produces results which are not straightforward to interpret. To overcome 
this problem, probability models are often used to produce more compact and 
conclusive results, albeit with some loss of accuracy. Bayesian statistics have 
been widely used in the development of probability models in machine learn- 
ing in recent years and these methods have many advantages, including their 
abilities to avoid overfitting. In this paper, we develop two Bayesian mixture 
models with the Dirichlet distribution prior and the Dirichlet process (DP) prior 
to improve the previous non-Bayesian mixture model developed for transaction 
dataset mining. We implement the inference of both mixture models using two 
methods: a collapsed Gibbs sampling scheme and a variational approximation 
algorithm. Experiments in several benchmark problems have shown that both 
mixture models achieve better performance than a non-Bayesian mixture model. 
The variational algorithm is the faster of the two approaches while the Gibbs 
sampling method achieves a more accurate result. The Dirichlet process mixture 
model can automatically grow to a proper complexity for a better approxima- 
tion. Once the model is built, it can be very fast to query and run analysis 
on (typically 10 times faster than Eclat, as we will show in the experiment sec- 
tion). However, these approaches also show that mixture models underestimate 
the probabilities of frequent itemsets. Consequently, these models have a higher 
sensitivity but a lower specificity. 
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1. Introduction 



Transaction data sets are binary data sets with rows corresponding to trans- 
actions and columns corresponding to items or attributes. Data mining tech- 
niques for such data sets have been developed for over a decade. Methods for 
finding correlations and regularities in transaction data can have many com- 
mercial and practical applications, including targeted marketing, recommendcr 
systems, more effective product placement, and many others. 

Retail records and web site logs are two examples of transaction data sets. 
For example, in a retailing application, the rows of the data correspond to 
purchases made by various customers, and the columns correspond to different 
items for sale in the store. This kind of data is often sparse, i.e., there may be 
thousands of items for sale, but a typical transaction may contain only a handful 
of items, as most of the customers buy only a small fraction of the possible 
merchandise. In this paper we will only consider binary transaction data, but 
transaction data can also contain the numbers of each item purchased (multi- 
nomial data). An important correlation which data mining seeks to elucidate is 
which items co-occur in purchases and which items are mutually exclusive, and 
never (or rarely) co-occur in transactions. This information allows prediction of 
future purchases from past ones. 

Frequent itemset mining and association rule mining [l[ are the key ap- 
proaches for finding correlations in transaction data. Frequent itemset mining 
finds all frequently occurring item combinations along with their frequencies 
in the dataset with a given minimum frequency threshold. Association rule 
mining uses the results of frequent itemset mining to find the dependencies be- 
tween items or sets of items. If we regard the minimum frequency threshold 
as an importance standard, then the set of frequent itemsets contains all the 
"important" information about the correlation of the dataset. The aim of fre- 
quent itemset mining is to extract useful information from the kinds of binary 
datasets which are now ubiquitous in human society. It aims to help people 
realize and understand the various latent correlations hidden in the data and 
to assist people in decision making, policy adjustment and the performance of 
other activities which rely on correct analysis and knowledge of the data. 

However, the results of such mining are difficult to use. The threshold or 
criterion of mining is hard to choose for a compact but representative set of 
itemsets. To prevent the loss of important information, the threshold is often 
set quite low, causing a huge set of itemsets which brings difficulties in interpre- 
tation. These properties of large scale and weak interpretability block a wider 
use of the mining technique and are barriers to a further understanding of the 
data itself. Traditionally, Frequent Itemset Mining (FIM) suffers from three dif- 
ficulties. The first is scalability, often the data sets are very large, the number of 
frequent item-sets of the chosen support is also large, and there may be a need 
to run the algorithm multiple times to find the appropriate frequency threshold. 
The second difficulty is that the support-confidence framework is often not able 
to provide the information that people really need. Therefore people seek other 
criteria or measurements for more "interesting" results. The third difficulty is 
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in interpreting the results or getting some explanation of the data. Therefore 
the recent focus of research of FIM has been in the following 3 directions. 

1. Looking for more compact but representative forms of the itemsets - in 
other words, mining compressed itemsets. The research in this direction 
consists of two types: lossless compression such as closed itemset mining 
Q and lossy compression such as maximal itemset mining 3]. In closed 
itemset mining, a method is proposed to mine the set of closed itemsets 
which is a subset of the set of frequent itemsets. This can be used to derive 
the whole set of frequent itemsets without loss of information. In maximal 
itemset mining, the support information of the itemsets is ignored and 
only a few longest itemsets are used to represent the whole set of frequent 
itemsets. 

2. Looking for better standards and qualifications for filtering the itemsets 
so that the results are more "interesting" to users. Work in this direction 
focuses on how to extract the information which is both useful and un- 
expected as people want to find a measure that is closest to the ideal of 
"interestingness" . Several objective and subjective measures are proposed 
such as lift [3], x 2 @ an d the work of @ in which they use a Bayesian net- 
work as background knowledge to measure the interestingness of frequent 
itemsets. 

3. Looking for mathematical models which reveal and describe both the 
structure and the inner-relationship of the data more accurately, clearly 
and thoroughly. There are two ways of using probability models in FIM. 
The first is to build a probability model that can organize and utilize the 
results of mining such as the Maximal Entropy model [7j. The second 
is to build a probability model that is directly generated from the data 
itself which can not only predict the frequent itemsets, but also explain 
the data. An example of such model is the Mixture model. 

These three directions influence each other and form the main stream of current 
FIM research. Of the three, the probability model solution considers the data 
as a sampled result from the underlying probability model and tries to explain 
the system in an understandable, structural and quantified way. With a good 
probability model, we can expect the following advantages in comparison with 
normal frequent itemset mining: 

1. The model can reveal correlations and dependencies in the dataset, whilst 
frequent itemsets are merely collections of facts awaiting interpretation. 
A probability model can handle several kinds of probability queries, such 
as joint, marginal and conditional probabilities, whilst frequent itemset 
mining and association rule mining focus only on high marginal and con- 
ditional probabilities. The prediction is made easy with a model. However, 
in order to predict with frequent itemsets, we still need to organize them 
and build a structured model first. 

2. It is easier to observe interesting dependencies between the items, both 
positive and negative, from the model's parameters than it is to discrimi- 
nate interesting itemsets or rules from the whole set of frequent itemsets or 
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association rules. In fact, the parameters of the probability model trained 
from a dataset can be seen as a collection of features of the original data. 
Normally, the size of a probability model is far smaller than the set of 
frequent itemsets. Therefore the parameters of the model are highly rep- 
resentative. Useful knowledge can be obtained by simply "mining" the 
parameters of the model directly. 
3. As the scale of the model is often smaller than the original data, it can 
sometimes serve as a proxy or a replacement for the original data. In real 
world applications, the original dataset may be huge and involve large 
time costs in querying or scanning the dataset. One may also need to 
run multiple queries on the data, e.g. FIM queries with different thresh- 
olds. In such circumstances, if we just want an approximate estimation, 
a better choice is obviously to use the model to make the inference. As 
we will show in this paper, when we want to predict all frequent itemsets, 
generating them from the model is much faster than mining them from 
the original dataset because the model prediction is irrelevant to the scale 
of the data. And because the model is independent from the minimum 
frequency threshold, we only need to train the model once and can do the 
prediction on multiple thresholds but consuming less time. 

Several probability models have been proposed to represent the data. Here 
we give a brief review. 

The simplest and most intuitive model is the Independent model. This as- 
sumes that the probability of an item appearing in a transaction is independent 
of all the other items in that transaction. The probabilities of the itemsets are 
products of the probabilities of the corresponding items. This model is obvi- 
ously too simple to describe the correlation and association between items, but 
it is the starting point and base line of many more effective models. 

The Multivariant Tree Distribution model Q , also called the Chow-Liu Tree, 
assumes that there are only pairwise dependencies between the variables, and 
that the dependency graph on the attributes has a tree structure. There are 
three steps in building the model: computing the pairwise marginals of the 
attributes, computing the mutual information between the attributes and ap- 
plying Kruskal's algorithm [ij to find the minimum spanning tree of the full 
graph, whose nodes are the attributes and the weights on the edges are the 
mutual information between them. Given the tree, the marginal probability of 
an itemset can be first decomposed to a production of factors via the chains 
rule and then calculated with the standard belief propagation algorithm [Toj . 

The Maximal Entropy model tries to find a distribution that maximizes the 
entropy within the constraints of frequent itemsets 11, 3] or other statistics 12 1 



The algorithm for solving the Maximal Entropy model is the Iterative Scaling 
algorithm. The Iterative Scaling algorithm is a process of finding the probability 
of a given itemset query. The algorithm starts from an "ignorant" initial state 
and updates the parameters by enforcing them satisfying the related constraints 
iteratively until convergence. Finally the probability of the given query can be 
calculated via the parameters. 
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The Bernoulli Mixture model ll|, [13[ is based on the assumption that there 
are latent or unobserved types controlling the distribution of the items. Within 
each type, the items are independent. In other words, the items are conditionally 
independent given the type. This assumption is a natural extension of the 
Independent model. The Bernoulli Mixture model is a widely used model for 
statistical and machine learning tasks. The idea is to use an additive mixture of 
simple distributions to approximate a more complex distribution. This model 
is the focus of this paper. 

When applying a mixture model to data, one needs to tune the model to the 
data. There are two ways to do this. In a Maximum- Likelihood Mixture Model, 
which in our paper we will call the non-Bayesian Mixture Model, the probability 
is characterised by a set of parameters. These are set by optimizing them to 
maximize the likelihood of the data. Alternatives are Bayesian Mixture models. 
In these, the parameters are treated as random variables which themselves need 
to be described via probability distributions. Our work is focused on elucidat- 
ing the benefits of Bayesian mixtures over non-Bayesian mixtures for frequent 
itemset mining. 

Compared with non-Bayesian machine learning methods, Bayesian approaches 
have several valuable advantages. Firstly, Bayesian integration does not suffer 
from over-htting, because it does not fit parameters directly to the data; it 
integrates overall parameters and is weighted by how well they fit the data. 
Secondly, prior knowledge can be incorporated naturally and all uncertainty is 
manipulated in a consistent manner. One of the most prominent recent de- 
velopments in this field is the application of Dirichlet process (DP) mix- 
ture model, a nonparametric Bayesian technique for mixture modelling, which 
allows for the automatic determination of an appropriate number of mixture 
components. Here, the term "nonparametric" means the number of mixture 
components can grow automatically to the necessary scale. The DP is an in- 
finite extension of the Dirichlet distribution which is the prior distribution for 
finite Bayesian mixture models. Therefore the DP mixture model can contain 
as many components as necessary to describe an unknown distribution. By us- 
ing a model with an unbounded complexity, under-fitting is mitigated, whilst 
the Bayesian approach of computing or approximating the full posterior over 
parameters mitigates over-fitting. 

The difficulty of such Bayesian approaches is that finding the right model for 
the data is often computational intractable. A standard methodology for DP 
mixture model is the Monte Carlo Markov chain (MCMC) sampling. However, 
MCMC approach can be slow to converge and its convergence can be difficult 
to diagnose. An alternative is the variational inference method developed in 



recent years 15|. In this paper, we develop both finite and infinite Bayesian 



Bernoulli mixture models for transaction data sets with both MCMC sampling 
and variational inference and use them to generate frequent itemsets. We per- 
form experiments to compare the performance of the Bayesian mixture models 
and the non-Bayesian mixture model. Experimental results show that Bayesian 
mixture model can achieve a better precision. The DP mixture model can find 
a proper number of mixtures automatically. 
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In this paper, we extend the non-Bayesian mixture model to a Bayesian 
mixture model. The assumption and the structure of the Bayesian model is 
proposed. The corresponding algorithms for inference via MCMC sampling 
and variational approximation are also described. For the sampling approach, 
we implemented Gibbs sampling algorithm [l6| for the finite Bayesian mixture 
model (GSFBM) which is a multi- variant Markov Chain Monte Carlo (MCMC) 
sampling \yA Ha . Il9j scheme. For the variational approximation, we implement 
the variational EM algorithm for the finite Bayesian mixture model (VFBM) 
by approximating the true posterior with a factorized distribution function. We 
also extend the finite Bayesian mixture model to the infinite. The Dirichlct 
process prior is introduced to the model so that the model obtains the ability 
to fit a proper complexity itself. This model solves the problem of finding the 
proper number of components used in traditional probability models. For this 
model, we also implement two algorithms. The first one is Gibbs sampling for 
the Dirichlet Process mixture model (GSDPM). The second one is the truncated 
variational EM algorithm for the Dirichlet Process mixture model (VDPM). 
The word "truncated" means we approximate the model with a finite number 
of components. 

The rest of the paper is organized as follows. In the next section, we de- 
fine the problem, briefly review the development of the FIM mining and intro- 
duce the notations used in this paper. In section 3, we introduce non-Bayesian 
Bernoulli mixture model and its inference by EM algorithm. In section 4 and 
5, we develop the Bayesian mixture models, including how to do inference via 
Gibbs sampling and variational EM and how to use the model for predictive in- 
ference. Then, in section 6, we use 4 benchmark transaction data sets to test the 
model, and compare the performances with the non-Bayesian mixture model. 
We also compare the MCMC approach and the EM approach by their result 
accuracies and time costs. Finally, we conclude this paper with a discussion of 
further works. 



2. Problem and Notations 

Let X = {ii, 12, ■ . ■ ,%d} be the set of items, where D is the number of items. 
Set / = {i mi , i m2 > • • • i im k } Q % i s called an itemset with length k, or a k-itemset. 

A transaction data set T over I is a collection of N transactions: X' 1 € 
T, /i = 1 . . . N. A transaction X M is a D dimension vector: x±, . . . , x% , . . . , x^) 
where x? € {0, 1}. A transaction is said to support an itemset / if and only 
if V« TO £ /, x^ — 1. A transaction can also be written as an itemset. Then X M 
supports I if / C X M . The frequency of an itemset is: 

|{/i|/CX^,X^6T}| 

' N 

An itemset is frequent if its frequency meets a given minimum frequency 
threshold: f m in- The aim of frequent itemset mining is to discover all the 
frequent itemsets along with their frequencies. 
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From a probabilistic view, the data set T could be regarded as a sampling 
result from an unknown distribution. Our aim is to find or approximate the 
probabilistic distribution which generated the data, and use this to predict all 
the frequent itemsets. Inference is the task of restricting the possible probability 
models from the data. In the Bayesian approach, this usually means putting 
a probability over unknown parameters. In the non-Bayesian approach, this 
usually means finding the best or most-likely parameters. 

3. Bernoulli Mixtures 

In this section, we describe the non-Bayesian mixture model. Consider a 
grocery store where the transactions are purchases of the items the store sells. 
The simplest model would treat each item as independent, so the probability 
of a sale containing item A and item B is just the product of the two prob- 
abilities separately. However, this would fail to model non-trivial correlations 
between the items. A more complex model assumes a mixture of independent 
models. The model assumes the buyers of the store can be characterized into 
different types representing different consumer preferences. Within each type, 
the probabilities are independent. In other words, the items are conditionally 
independent, when conditioned on the component, or type, which generated the 
given transaction. However, although we observe the transaction, we don't not 
observe the type. Thus, we must employ the machinery of inference to deal with 
this. 

Suppose there are K components or types, then each transaction is gen- 
erated by one of the K components following a multinomial distribution with 
parameter 7r = (tti, . . . , ttk), where 2fe=i n k = 1- Here we introduce a compo- 
nent indicator Z = indicating which components the transactions are 
generated from: z M = k if X M is generated from the fcth component. According 
to the model assumption, once the component is selected, the probabilities of 
the items are independent from each other. That is, for transaction X p : 

D 

p(x"Ke) = IJp(^l« /1 ,e). (!) 

where representing all the parameters of the model. Thus, the probability of 
a transaction given by the mixture model is: 

K D 

p(X^0)=^7r fe nK<l^©) (2) 

k=l i=l 

Since the transactions are binary vectors, we assume the conditional probability 
of each item follows a Bernoulli distribution with parameter (j)^ : 

p(x?\z^®)=cl>S(l-M 1 - x * (3) 

A graphic representation of this model is shown in Figure [T] where circles denote 
random variables, arrows denote dependencies, and the box (or plate) denote 
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Figure 1: non-Bayesian mixture graphic representation 



Algorithm 1 EM algorithm for Bernoulli Mixtures 



initialize -K k and <f>ik 
repeat 

for fi = 1 to N do 
for k = 1 to K do 

u _ nit TliLi <t>il ( 1 -</ > i) ! ) 1_Xi 

eJ5=i**' nJU*?^ 1 -**'*') 1 

end for 
end for 

1 v-^N M 
= fL^l^ 

<P»* — r N -jr- 

until convergence 



replication over all data points. In Figure [TJ the distribution of each transaction 
depends on the selection of z M and model parameter </>, and z M depends on 
7r. This process will repeated N times to generate the whole data set. 

In this model, we need to estimate tt^ and fak from the data. If we knew 
which component generated each transaction this would be easy. For example, 
we could estimate 4>ik as the frequency at which i occurs in component k and 
would be the frequency at which component k occurs in the data. Unfortu- 
nately, we do not know which component generated each transaction; it is an 
unobserved variable. The EM algorithm [20] is often used for the parameter 
estimation problem for models with hidden variables in general, for mixture 
models in particular. We describe this in more detail in Appendix 1. For a 



detailed explanation, see section 9.3.3 of [2l|. The EM algorithm is given in 
Algorithm [T] 

Another problem of this algorithm is the selection of K. The choice of K 
will greatly influence the quality of the result. If the K is too small, the model 
cannot provide accurate enough result. On the opposite, if the K is too large, 
it may cause over-fitting problems. There is no single procedure to find out the 
correct K. People often try several increasing Ks and determine the proper K by 
comparing their result qualities and preventing over-fitting by cross-validation 
or some other criteria such as the Bayesian Information Criterion [22j . 

Predicting frequent itemsets by this model is quite straightforward. For any 
itemset /, calculating its probability is done by only taking into account the 
items occurring in / and ignoring (e.g. marginalizing over) the items which are 



8 



not in /: 

K 

p(I\®) = J2^k J] (4) 

fc=l i m £l 

The number of free parameters used for prediction is K (D + 1) — 1. 

The last issue is how to generate the full set of frequent itemsets. In frequent 
itemset mining algorithms, obtaining the frequencies of the itemsets from the 
data set is always a time consuming problem. Most algorithms such as Apri- 
ori 



231 ] require multiple scans of the data set, or use extra memory cache for 
maintaining special data structure such as MdJists for Eclat |24j and FP-tree 
for FP-growth (25j . In the Bernoulli mixture model approach, with a prepared 
model, both time and memory cost can be greatly reduced with some accuracy 
loss since the frequency counting process has been replaced by a simple calcu- 
lation of summation and multiplication. To find the frequent itemsets using 
any of the probability models in this paper, simply mine the probability models 
instead of the data. To do this, one can use any frequent itemset datamining 
algorithm; we use Eclat. However, instead of measuring the frequency of the 
itemsets, calculate their probabilities from the probability model. 

Typically this results in a great improvement in the complexity of the de- 
termination of itemset frequency. For a given candidate itemset, to check the 
exact frequency of the itemset, we need to scan the original dataset for Apriori, 
or check the cached data structure in memory for Eclat. In both algorithms, 
the time complexities are 0{N) where N is the number of transactions of the 
dataset. However, the calculation in mixture model merely need KL times mul- 
tiplication and K times addition, where L is the length of the itemset. Normally, 
KL is much smaller than N. 

The exact search strategy with Bernoulli mixture model is similar to Eclat or 
Apriori based on the Apriori principle [23j ]: All frequent itemsets' sub-itemsets 
are frequent, all infrequent itemsets' super-itemsets are infrequent. Following 
this principle, the searching space could be significantly reduced. In our research 
we use the Eclat lattice decomposing framework to organize the searching pro- 
cess. We do not plan to discuss this framework in detail in this paper. A more 



specific explanation is given by [24 1 . 



4. The Finite Bayesian Mixtures 

4-1- Definition of the model 

For easier model comparison, we use the same notation in non-Bayesian 
model, finite Bayesian model and the later infinite Bayesian model when this 
causes no ambiguity. The difference between Bayesian mixture models and 
non-Bayesian mixture models is that Bayesian mixtures try to form a smooth 
distribution over the model parameters by introducing appropriate priors. The 
original mixture model introduced in previous section is a two-layer model. The 
top layer is the multinomial distribution for choosing the mixtures, and the next 
layer is the Bernoulli distribution for items. In Bayesian mixture we introduce a 
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Figure 2: finite Bayesian mixture graphic representation 



Dirichlet distribution 14| as the prior of the multinomial parameter tt and Beta 
distributions as the priors of the Bernoulli parameters {4>ik}- The new model 
assumes that the data was generated as follows. 

1. Assign a, P and 7 as the hyperparameters of the model, where a, /3 and 
7 are all positive scalars. These will be chosen apriori. 

2. Choose 7r ^Dir(a) where 



p(ir\a) 



r(a) 



T(a/K) 



K 



K 

n 

k=l 



a/K-1 



(5) 



with y\ —1 7Tfc = 1, ~ denotes sampling, and Dir is the Dirichlet distribu- 
tion. 

3. For each item and component choose 4nk ~Beta(/3, 7) where 



p{MH,i) = r(mr(7) ^ 1(1 " ^ )7-1 



(6) 



with G [0, 1] where i G {1, . . . , D}, k G {1, . . . , K} and Beta denotes 
the Beta distribution. 
4. For each transaction X M : 

(a) Choose a component ^Multinomial(7r), where 



p(z^ = k\n) = 7T k 
(b) Then we can generate data by: 



(7) 



P 



(xr^,0)=n^(i 



o 1 



Figure [5] is a graphic representation for Bayesian mixtures. 
This process can be briefly written as: 

Tr\a ~ Dir(a/-fT, a/K, ... , a/K) 
fc |/3, 7 ~Beta(/3, 7 ) 
z m |tt ~ Multi(-Tr) 



(8) 



(9) 
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In other words, the assumption is that the data was generated by first doing 
the first two steps to get the parameters, then doing the second two steps N 
times to generate the data. Since important variables of the model are not 
known, namely 7r, <p, and the Bayesian principles say that we should com- 
pute distributions over these, and then integrate them out to get quantities of 
interest. However, this is not tractable. Therefore, we implement two common 
approximation schemes: Gibbs sampling and variational Bayes. 

4-2. Finite Bayesian mixtures via Gibbs sampling 

One approach for Bayesian inference is to approximate probabilistic integrals 
by sums of finite samples from the probability distribution you are trying to 
Gibbs sampling is an example of the Markov chain Monte Carlo method, which 
is a method of sampling from a probability. Gibbs sampling works by sampling 
one component at a time. We will use a collapsed Gibbs sampler, which means 
we will not use sampling to estimate all parameters. We will use sampling to 
infer the components which generated each data point and integrate out the 
other parameters. 

We first introduce the inference of the model via the Gibbs sampling. Similar 
to the non-Bayesian mixture model, we need to work on the distribution of the 
component indicator Z. According to the model, the joint distribution of Z is: 



p{Z) = / p(Z\ir)p(ir)dir 

J 7T 

= F W [ff 



N 

a/K-l TT J(z" = fc) 



n 



dir 



r(a) A T(N k + a/K) 
T{N + a) ii T (a/K) { ] 

where N k is the number of points assigned to fcth component, the integral over 7r 
means the integral over a (K — l)-dimcnsion simplex and the indicator function 
/(z^ = k) means: 

'l, ifz^ = fc 



I{z» = k) 



0, if ^ k 



The conditional probability of the /zth assignment given the other assignments 
are: 

where N k /^y is the number of points assigned to fcth component except the /nth 
point. The posterior distribution of the Bernoulli parameter <p k is the following 
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if we know the component assignment: 



p(4> h \Z, T) oc P {T\<t> k ,Z)p{<t> k \p n ) (12) 

D 

x 

i=i 



[|Beta(0 lfc |/3 lfc , 7lfc ) (13) 



where 



N 



ftfc = + ^ = k X 

iV 

7^ =i+N k ~Y, w = k y- 



Combining Equation (fTTj) and (fl"3j) . we can calculate the posterior probability 
of the /xth assignment by integrating out 4>: 

p(z» = k\Z^,T)= [ p(z» = k\Z^)p(cj> k \Z^,T)d<l> k 
„ N k/{M}+<*/K yr ( Afc/M Y'f 7ife/{ M } V"*' fw) 



N-l + a fJ^yp + j + NkJ \(3 + -f + N k 

where -^fc/{ At },Afc/{ At } and 7ifc/{ M } are calculated excluding the /ith point and the 
integral over <p k means integral over a .D-dimension vector <p k £ [0, 1] D . Equa- 
tion (fT4"|) shows how to sample the component indicator based on the other as- 
signments of the transactions. The whole process of the collapsed Gibbs sampling 
for the finite Bayesian mixture model is shown in Algorithm [2] Initialization of 
parameters a, (3, and 7 is discussed in section [6] 

The predictive inference after Gibbs sampling is quite straightforward. We 
can estimate the proportion and the conditional probability parameters by the 
sampling results. The proportion is inferred from the component indicator Z 
we sampled: 

N k + a/K 

*" = N + a (15) 
The conditional Bernoulli parameters are estimated as following: 

= JTTTN- k (16) 

For a given itemset I, its predictive probability is: 

K 

fc=i i m ei 

In practice, the parameters n k and <pi k only need to be calculated only once for 
prediction. The model contains K X {D + 1) — 1 free parameters. 
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Algorithm 2 collapsed Gibbs sampling for finite Bayesian mixture model 
input parameters a, f3, 7 

input parameter K as the number of components 

initialize Z to be a random assignment 

repeat 

for jj, — 1 to iV do 

For all i, k update /3ik,~/ik by 

For all fc calculate multinomial probabilities based on 

„(--M _ U\ _ + TIP f ftfc W f 7ifc V"^ 

P{ - I a+N-1 \(3+j+N k ) \{l + j + N k ) 

Normalize p(z^ — k) over k 
Sample z M based on p(z^) 
end for 
until convergence 



4-3. Finite Bayesian Mixture Model via Variational Inference 

In this section we describe the variational EM algorithm [2(3 . 21 1 for this 
model. Based on the model assumption, the joint probability of the transaction 
X M , components indicator z M and the model parameters 7r and <f> is: 

p(X",^,7r, #*,/?, 7) =p{^\z^4>)p{z^)p{cj>\p, 1 )p(-K\a) (18) 

For the whole data set: 

N 

p(T, Z, n, <f>\a, (3, 7) = J] b( x 1^ 0)p(^k)] P (0|/?, 7 )p(w |o) (19) 

n=i 

Integrating over 7r, 0, summing over 2 and taking the logarithm, we obtain the 
log-likelihood of the data set: 

mp(7>,/3,7)=m f /"^p(T,.Z,7r,0|a,M)#*r (20) 

Here the integral over 7r means integral over a (if — l)-dimension simplex. The 
integral over <fi means integral over a K x D vector <fi £ [0, l] KxD . The summing 
over Z is summing over all possible Z configurations. This integral is intractable 
because of the coupling of Z and tv. This approximate distribution is chosen so 
that: the variables are decoupled, and the approximate distribution is a close 
as possible to the true distribution. In other words, the task is to find the 
decoupled distribution most like the true distribution, and use the approximate 
distribution to do inference. 
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We assume the distribution has the following form: 



q(Z,TT,<t>\T,p,r),v) 



N 



D K 



n n 1^^,^) 



i=i k=i 



q{iz\p) 



(21) 



where 

q{z^\r^) ~ Multinomial^) 
q((f>ik\r)ik,Vik) ~ Beta(?7ifc, v ik ) 
q{ir\p) ~ Dir(p) 

Here p, rj and v are free variational parameters corresponding to the hyperpa- 
rameters a, (3 and 7, and r is the multinomial parameter for decoupling 7r and 
Z. We use this g(-) function to approximate the true posterior distribution of 
the parameters. To achieve this, we need to estimate the values of p, r/ and v. 
Similar to non-Bayesian mixture EM, we expand the log-likelihood and optimize 
its lower bound. The optimization process is quite similar to the calculations 
we did in non-Bayesian EM part. In the optimization, we use the fact that 
-Ejlog 7Tfc] = ^(ctfc) — ^{J2k'=i ak ') i 1 n ~ Dir(a) where >]/(.) is the digamma 
function. This yields: 



Vik =P + YZ=\ T k x i 
Vik =7 + E^=i T fc( 1 
r^cxexp{*(p fe )-*(E^ = iP^) 



(22) 
(23) 
(24) 



(25) 



Equation (1221) to (|25| form an iterated optimization procedure. A brief demon- 
stration of this procedure is given by Algorithm [3] 

For any itemset /, its predictive probability given by the model is: 

P(I\l)= / ^2p(I\z,(t>)p(z\ir)q(Tr,cj)\p,r],v>)dTrd(t> 



I 7T J <p 

K 



E 



pk 



~1 J2k=l Pk' i X gj Vmk + Vmk 



n 



Vmk 



(26) 



In Equation (f2"6")> . we use the decoupled q(-) to replace the true posterior dis- 
tribution so that the integral is solvable. Equation (|26p shows that when doing 
predictive inference, we only need to take care the value of pk, r\ik and Vik 
proportionally. Therefore the number of parameters is exactly the same as the 
non-Bayesian model. 
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Algorithm 3 Variational EM for Finite Bayesian Bernoulli Mixtures 



input parameters a, j3 and 7 

input parameters K as the number of components 

initialize to be a random assignment 

repeat 

For all i, k update pk,rjik,^ik by 
p k = a + z2 tl =i T k 

Thk = P + Y, N ^i T k x i 

Vik =7 + E^=iT"fc( 1 -<) 
for /1 = 1 to N do 

for k = 1 to K do 

Update ri according to (|25p 
end for 

Normalize t£ over 
end for 
until convergence 



5. The Dirichlet Process Mixture Model 

The finite Bayesian mixture model is still restricted by the fact that the 



number of components K must be chosen in advance. Ferguson 1J| proposed 
the Dirichlet Process (DP) as the infinite extension of the Dirichlet distribution. 
Applying the DP as the prior of the mixture model allows us to have an arbitrary 
number of components, growing as necessary during the learning process. In the 
finite Bayesian mixture model, the Dirichlet distribution is a prior for choosing 
components. Here the components are in fact distributions drawn from a base 
distribution Beta(/3, 7). In Dirichlet distribution, the number of components is 
a fixed number K. So each time we draw a distribution, the result is equal to 
one of the K distributions drawn from the base distribution with probabilities 
given by the Dirichlet distribution. Now we relax the number of components as 
unlimited and keep the discreteness of the components, which means that each 
time we draw a distribution (component), the result is either equal to an existed 
distribution or a new draw from the base distribution. This new process is called 



Polya urn scheme 27J : 



the Dirichlet Process [14[ and the drawing scheme is the Blackwell-MacQueen's 



N, 



k with prob. ^31+^ (27) 
K + l, 4> K+1 ~ Beta(/3,7) with prob. N _ a 1+a 



The previous model should also be rewritten as: 



B|a,Bo~DP(a,fl (/9,7)) 
4>^\B - B 

X^^ p (X^) (28) 
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5.1. The Dirichlet Process Mixture Model via Gibbs Sampling 

Based on the Polya urn scheme we can allow K to grow. Following this 
scheme, every time we draw a distribution, there is a chance that the distribution 
comes from the base distribution, therefore adding a new component to the 
model. This scheme makes the K has the potential to grow to any positive 
integer. 

Assume at a certain stage, the actual number of components is K. Based 
on Equation J27J): 

p(*" = fc|Z_ u ) = Nk/M , iff k < K 
Then the probability that the /ith point is in a new component is: 

rv 

p(z» = K + 1\Z^) = 1 < K\Z^) = 



N-l + a 

The rest of the posterior probability remains the same, as there is no K involved: 
p{z» = k\Z-^T) 

N k/{y} ( Pik \ X% ( lik \ 1 ~ X * (2Q) 

N-l + al = \\l3 + i + N k J \P + i + N k J 



oc 



For the new component, Nk+i = and we have, 
p(z»=K + 1\Z_^T) 

Equation (|^|) and ([50)1 form a collapsed Gibbs sampling scheme. At the be- 
ginning, all data points are assigned to one initial component. Then for each 
data point in the data set, the component indicator is sampled according to the 
posterior distribution provided by Equation (|^)) and (|3U|) . After the indicator 
is sampled, the relevant parameters Nk, Pik and 7^ are updated for next data 
point. The whole process will keep running until some convergence condition is 
met. Algorithm 2] describes the method. 

The predictive inference is generally the same as the finite version. 



5.2. DP Mixtures via Variational Inference 

Although the Gibbs sampler can provide a very accurate approximation to 
the posterior distribution for the component indicators, it needs to update the 
relative parameters for every data point. Thus it is computational expensive 
and not very suitable for large sc ale p roblems. In 1994, Sethuraman developed 
the stick-breaking representation [28j of DP which captures the DP prior most 
explicitly among other representations. In the stick-breaking representation, 
an unknown random distribution is represented as a sum of countably infinite 
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Algorithm 4 collapsed Gibbs sampling for Dirichlet process mixture model 
input parameters a, f3, 7 
initialize K = 1 
repeat 

for (i — 1 to N do 

For all i, k with 1 < k < K, update /3ik,jik by 



a; 7 



Calculate multinomial probabilities based on 



fc /{^> TT 15 f Sii \ X * ( Tifc \ ^ if — K" < K" 



a+JV 

Normalize = ft) over K + 1 
Sample z M based on p^z 11 ) 
if component K + 1 selected then 

K = K + 1 
end if 
end for 
until convergence 



atomic distributions. The stick-breaking representation provide a possible way 
for doing the inference of DP mixtures by variational methods. A variational 
method for DP mixture has been proposed by [2j|. They showed that the 
variational method produced comparable result to MCMC sampling algorithms 
including the collapsed Gibbs sampling, but is much faster. 

In the transaction data set background, the target distribution is the distri- 
bution of the transaction p(X M ) and the atomic distributions are the conditional 
distributions such as ^(X^lz^). Based on the stick-breaking representation, the 
Dirichlet process mixture model is the following. 

1. Assign a as the hyperparameter of the Dirichlet process, (3, 7 as the hy- 
perparameters of the base Beta distribution, where they are all positive 
scalars. 

2. Choose Vk ~ Beta(l, a),k = 1, 00 

3. Choose <pik ~ Beta(/3, 7), i = 1, . . . , D; k = 1, . . . 

4. For each transaction X p : 

(a) Choose a component ^Multinomial(7r(v)) where 

k-1 

7Tfc(v) =V k JJ(l-«l) ( 31 ) 

1=1 

(b) Then we can generate data by: 

D 

p(X^0)=n0&(l-&*O 1_xf (32) 

i=l 
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Figure 3: Graphic representation of DP mixture in stick-breaking representation 



The stick-breaking construction for the DP mixture is depicted in Figure |31 
With the model assumption, the joint probability of the data set 7~, components 
indicators Z and the model parameters v and <fi i s: 



p(T,Z,v,(j>\a,p,j) 



N 



II b(XlA 4>)p{z^\w)]p{ct>\p, 7 )p(v|a) 
,1=1 



(33) 



Integrating over w, </>, summing over Z and applying the logarithm, we obtain 
the log-likelihood of the data set: 



lnp(T|a,/3,7) = In / / ^p(T, Z,v, </>\a, /3,j)d4>d\ 

J v J d> -7 



(34) 



Here the integral over v means integral over a vector v £ [0, 1]°°. The integral 
over <fi means integral over aooxD vector <f> e [0, 1] °° xD . The summing over 
Z is summing over all possible Z configurations. This integral is intractable 
because of the integral over infinity dimensions and the coupling of Z and v. 
Notice the following limit with a given truncation K: 



K 



lim [1 

K— >oo 



$> fe (v)] = lim l[(l-v k ) = 

* — 4 A — too 



(35) 



fe=i 



k=l 



Equation (|33|) shows that for a large enough truncation level K, all the com- 
ponents beyond the Kth component could be ignored as the sum of their pro- 
portion is very close to 0, which means that it is possible to approximate the 
infinite situation by a finite number of components. The difference with finite 
Bayesian model is that in finite Bayesian mixture, the number of component is 
finite; but in truncated DP mixture, the number of component is infinite. We 
only use a finite distribution to approximate it. Therefore we can use a finite 
and fully decoupled function as the approximation of true posterior distribution. 
We propose the following factorized family of variational distribution: 



q{Z,v,d)\T,p 1 ,p 2 ,r),u) 



N 



lii=i 



K D 



k=l i=l 



K-l 



J\ q(vk\pik,P2k) (36) 



k=l 



where 
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Algorithm 5 Variational EM for DP Bernoulli Mixtures 



input parameters a, j3 and 7 

input parameter K as the truncated number of components 

initialize to be a random assignment 

repeat 



For all i,k update p lk , p2k,Vik,^ik by 




for p = 1 to N do 
for k = 1 to K do 

Update Tj? according to l|41[) 
end for 

Normalize t^ 1 over A; 
end for 
until convergence 



g(z M |T M ) - Multinomial(r M ) 

q{<Pik\ilik,v tk ) ~ Beta(j7ifc,i/ik) 
9(wfc|pife,P2fe) ~ Beta(pi fc ,p 2 fc) 

Here p 1 , p 2 , f) and v are free variational parameters corresponding to the hy- 
perparametcrs 1, a, f3 and 7, and t is the multinomial parameter for decoupling 
v and 2. As we are assuming the proportion of the components beyond K is 0, 
the value of vk m the approximation is always 1. We use this q{-) function to 
approximate the true posterior distribution of the parameters. To achieve this, 
we need to estimate the values of p l5 p 2 > V an d v ■ A detailed computation of 
the optimization is given by ;29J. The optimization yields: 



Equation (|37|) to (|4ip form an iterated optimization procedure. A brief demon- 
stration of this procedure is given by Algorithm [5] 




(37) 
(38) 
(39) 
(40) 




(41) 
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Name 


N 


D 


N Vs 


Density 


chess 


3197 


75 


118252 


49.32% 


mushroom 


8125 


119 


186852 


19.33% 


MS Web Data 


37711 


294 


113845 


1.03% 


accidents 


341084 


468 


11500870 


7.22% 



Table 1: General Characteristics of the testing data sets: N is the number of records, D is 
the number of items, Ny a is the number of "l"s and the Density reflects the sparseness of 
the data set which is calculated by Density = N 1 r 3 /(ND) 



The predictive inference is given by Equation (|42p . Same as we did in finite 
model, we use the decoupled q(-) function to replace the true posterior distribu- 
tion so that we can do the integral analytically. In fact we only need to use the 
value of — — nf/" 1 ! — £2JlL — as the proportion of each component. Thus the 

Plk+P2k Plfc'+P 2 fc' ... . 

number of parameters used for prediction is still the same as the finite model if 
we set the truncation level to be the same value as the number of components 
K in the finite model. 

p( i W) = / ^P(I\z,4>)p(zW)q(v,(t>\p 1 ,p 2 ,Tj,v)dvd(t) 

T k—1 
_ Plk TT P2k' TT Vmk ^ 

~, Plk + P2k Plk' + P2k' *-}: T Vmk + V mk 

k—1 k—1 tm£l 



6. Empirical Results and Discussion 

In this section, we compare the performances of proposed models with the 
non-Bayesian mixture model using 5 synthetic data sets and 4 real bench- 
mark data sets. We generate five synthetic datasets from five mixture mod- 
els with 15, 25, 50, 75 and 140 components respectively and apply the four 
methods to the synthetic datasets to see how closely the new models compare 
with the original mixture model. For the real data sets, we choose the mush- 
room, chess, Anonymous Microsoft Web data [Io| and accidents (3l|. The data 
sets mushroom and chess we used were transformed to discrete binary data 
sets by Roberto Bayardo and the transformed version can be downloaded at 
http://fimi.ua.ac.be/data/. In Table Q] we summarize the main character- 
istics of these 4 data sets. We randomly sampled a proportion of the data sets 
for training and used the rest for testing. For synthetic data sets, mushroom 
and chess, we sampled half of the data and used the rest for testing. For MS 
Web data, the training and testing data sets were already prepared as 32711 
records for training and 5000 records for testing. For accidents, as this data 
set is too large to fit into the memory, we sampled 1/20 as training data and 
sampled another 1/20 for testing. We use the following 3 evaluation criteria for 
model comparison. 

1. We measure the difference between the predicted set of frequent itemsets 
and the true set of frequent itemsets by calculating the false negative rate 
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(F ) and the false positive rate (F + ). They are calculated by 



F~ = 



N M 




N F 



Nm+Nc 



N F + N C 



where Nm is the number of itemsets that the model failed to predict, Np 
is the number of itemsets that the model falsely predicted and Nq is the 
number of itemsets that the model predicted correctly. Note that 1 — F~ 
gives the recall and 1 — F = gives the precision of the data-miner. 
2. For any true frequent itemset /, we calculate the relative error by: 



where pm (I) is the probability predicted by the model. The overall quality 
of the estimation E is: 



where Nj is the total number of true frequent itemsets. 
3. To test whether the model is under-estimating or over-estimating, we de- 
fine the empirical mean of relative difference for a given set S as: 



The parameter settings of the experiments are as follows. As the aim of 
applying the algorithms on the synthetic datasets is to see how closely the new 
models compare with the original mixture model, we assume that we already 
know the correct model before learning. Therefore for the synthetic data sets, 
we choose the number of components the same as the original mixture model 
except the DP mixture via Gibbs sampling. For the real datasets, we used 15, 
25, 50 and 75 components respectively for the finite Bayesian models and the 
truncated DP mixture model. For the DP mixture model via Gibbs sampler, 
we don't need to set K. For each parameter configuration, we repeat 5 times to 
reduce the variance. The hyper-parameters for both finite and infinite Bayesian 
models are set as follows: a equals 1.5, /3 equals the frequency of the items in 
the whole data sets and 7 equals 1 — /?. 

The last parameter for the experiments is the minimum frequency threshold. 
As we mentioned, in practical situation, there is no standard procedure to select 
this threshold. However in our experiments, as the goal is to test our models, 
the requirement of the threshold is that we need to make the itemsets frequent 
enough to represent the correlation within the data sets, while generating enough 
number of frequent itemsets for model comparison as well. The threshold also 
should not be too low as a low threshold may make the experiments taking too 
much time. According to the characteristics of the datasets and several test runs, 
we set the thresholds of the data sets as in Table [2] so that the numbers of the 




(43) 




(44) 
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chess 


mushroom 


MS Web 


accidents 


threshold 


50% 


20% 


0.5% 


30% 


total 


1262028 


53575 


570 


146904 


l 


37 


42 


79 


32 


2 


530 


369 


214 


406 


3 


3977 


1453 


181 


2545 


4 


18360 


3534 


85 


9234 


5 


57231 


6261 


11 


21437 


6 


127351 


8821 





33645 


7 


209743 


10171 





36309 


8 


261451 


9497 





26582 


9 


249427 


7012 





12633 


10 


181832 


4004 





3566 


>10 


152089 


2411 





515 



Table 2: Minimum frequency threshold and the number of frequent itemsets 



frequent itemsets are proper for model evaluation. The numbers of the frequent 
itemsets with different lengths are also listed. For the synthetic datasets, the 
minimal support threshold are 30%. 

The test results of the synthetic datasets are shown in Table [3] where F~ is 
the False Negative Rate in percentage, F + is the False Positive Rate in percent- 
age and the E is the Empirical Relative Error in percentage. We also calculate 
the standard errors of these values. In the table, NBM, VFBM, GSFBM, VDPM 
and GSDPM are short for non-Bayesian mixture, finite Bayesian mixture via 
variational EM, finite Bayesian mixture via Gibbs sampler, DP mixture via 
variational EM and DP mixture model via Gibbs sampler respectively. For the 
number of components of the DP mixture via Gibbs sampler, we use the mean 
of the number of components used in five trials. 

From Table [3] we can observe that the average empirical errors of all four 
Bayesian methods are below 2%, which means these methods can recover the 
original model with a relatively small loss of accuracy. Comparing all the meth- 
ods, GSFBM fits the original model best. Non-Bayesian model gives the worst 
overall estimation but the best false positive rate. The results of the other ap- 
proaches are generally comparable. With regards to specific datasets, in the tests 
on Syn-15 and Syn-25, VDPM is slightly better than GSDPM, and GSDPM is 
slightly better than VFBM. In the tests of Syn-50, the results of GSDPM and 
VDPM are very close and both are slightly better than VDPM. In the tests on 
Syn-75, the three methods give similar results. In the tests on Syn-140, GSDPM 
outperforms the other two approaches whilst the rest are close. 

Although the empirical relative errors of the four approaches are only about 
l%-2%, the F~ is much higher relatively. This can be explained by the dis- 
tribution of the frequent itemsets over frequency. Figure 2] is the distribution 
of frequent itemsets with different frequencies of the synthetic dataset Syn-15. 
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Criteria 


F~ 


F+ 


E 


Synl5 


NBM 


K=15 


9.55±0.63 


1.40±0.24 


2.12±0.12 


VFBM 


K=15 


4.11±0.77 


2.38±0.49 


1.25±0.13 


GSFBM 


K=15 


3.19±0.18 


2.93±0.04 


1.17±0.02 


VDPM 


K=15 


3.54±0.25 


2.69±0.48 


1.18±0.05 


GSDPM 


K=12.6 


3.84±0.41 


2.71±0.37 


1.24±0.03 


Syn25 


NBM 


K=25 


9.50±0.58 


1.24±0.21 


1.94±0.10 


VFBM 


K=25 


3.73±1.57 


2.35±0.37 


1.60±0.20 


GSFBM 


K=25 


2.63±0.74 


2.84±0.28 


0.95±0.07 


VDPM 


K=25 


3.46±0.70 


2.48±0.48 


1.06±0.13 


GSDPM 


K=19 


3.63±1.13 


2.71±0.64 


l.llzbO.ll 


Syn50 


NBM 


K=50 


10.29±0.55 


0.93±0.09 


2.03±0.10 


VFBM 


K=50 


5.46±0.65 


1.26±0.16 


1.19±0.12 


GSFBM 


K=50 


3.16±0.32 


1.60±0.11 


0.85±0.03 


VDPM 


K=50 


5.14±0.57 


1.23±0.20 


1.13±0.07 


GSDPM 


K=31 


5.20±1.07 


1.21±0.17 


1.13±0.16 


Syn75 


NBM 


K=75 


9.59±0.22 


0.71±0.12 


1.79±0.07 


VFBM 


K=75 


5.92±0.86 


0.70±0.09 


1.14±0.14 


GSFBM 


K=75 


4.34±0.60 


0.81±0.07 


0.89±0.09 


VDPM 


K=75 


6.04±0.54 


0.67±0.08 


1.14±0.08 


GSDPM 


K=49.6 


5.76±0.57 


0.91±0.11 


1.14±0.08 


Synl40 


NBM 


K=140 


11.59±0.32 


0.68±0.06 


2.31±0.06 


VFBM 


K=140 


8.49±0.54 


1.03±0.07 


1.76±0.12 


GSFBM 


K=140 


5.43±0.30 


1.27±0.16 


1.22±0.01 


VDPM 


K=140 


8.66±0.40 


1.14±0.21 


1.80±0.04 


GSDPM 


K=65 


6.66±0.26 


1.45±0.13 


1.47±0.04 



Table 3: Test result of synthetic datasets (%), average of 5 runs 



We use this dataset as an example to demonstrate the sensitivity of estimation 
error on "edge" itemsets. The rest of the datasets have similar distributions. 
From this figure, we can see that there are over 35,000 itemsets in the range 
of 0.30-0.32, which means about one third of the frequent itemsets are on the 
"edge" of the set of frequent itemsets. Assuming that a model under-estimates 
each itemset by about 7%, an itemset with a frequency of 32% will be estimated 
as 32% x (1 — 7%) = 29.76%, which is below the minimum frequent threshold 
and the itemset will be labelled as infrequent. This 7% under-estimation will 
eventually cause a false negative rate of over 30%. The reason why the model 
tends to under-estimate will be discussed later. This example is an extreme 
circumstance. However, it is clear that with a pyramid like distribution of the 
frequent itemsets, when we use the F~ and F + criteria, the actual estimation 
error will be amplified. 

The aim of introducing synthetic datasets is to validate the optimization 
ability of the five approaches. We want to check whether the algorithms for 
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Synthetic, K=15 

40000 I 1 1 , r- 




0.30-0.32 0.33-0.34 0.34-0.30 0.30-0.38 0.38-0.40 >0.40 



Figure 4: Distribution of the frequent itemsets over frequency of dataset Syn-15 

mixture models can find the right parameters of the wanted model with correct 
K s. The results show that the losses of the five approaches are acceptable with 
a not-so-large number of components. When the model gets more complicated, 
the loss caused by the algorithm tends to increase. For the real datasets we 
tested them with 15, 25, 50 and 75 components respectively. The test results 
are shown in Figure |H 

For 'chess', Gibbs sampler used 29.6 components on average. Its result is 
comparable to the rest of the algorithms with K — 25, but not as good as VDP 
with K — 50 and K = 75. However, the improvement of VDP when raising 
truncation level from 25 to 75 is not very great, showing that the proper number 
of components might be around 30. The false positive and false negative rates 
look high, but the average estimation error is only around 3%. This is because 
18.04% of the frequent itemsets' frequencies are just a 2.5% higher than the 
threshold, therefore a little under-estimation causes a large false negative rate. 

The data set 'mushroom' is a quite famous but strange data set. There 
are quite a lot of itemsets' which their frequencies are just above the minimum 
threshold. Thus a little under-estimation might cause a great false negative rate. 
The difference in relative error between non-Bayesian and Bayesian models is 
about 6 percent. However the difference in the false negative rate is large. When 
training the VDP model, we find it is very likely that the algorithm is stuck in 
some local minimum, which causes significant difference among all the 5 trials. 
In some trials, the F~ are as low as about 1% while in other trials, the F~ 
rises to about 25%. That is the reason that the standard deviations of VDP at 
truncation level of 15 and 25 are larger than the averages. On the other hand, 
Gibbs sampler suggests that about 19 components are enough to approximate 
the distribution of 'mushroom'. It gives better results than VDP at truncation 
level of 50. 

In the experiments for 'accidents', GS uses 114.6 components on average, 
far more than 50. Therefore it gives more accurate results than the other al- 
gorithms. For 'MS Web', all three models seriously under-estimate the true 
probabilities. Yet Bayesian models work better than NBM. GS uses about 
140 components to get a result better than finite models. The phenomenon of 
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Name 


Chess 


Criteria 




F+ 






K=15 


22.47±1.24 




3.89±0.17 


NBM 




19.71±1.30 




3.47±0.19 




K=50 


17.22±0.98 




3.12±0.08 






15.01±1.05 




2.89±0.12 




K L5 


ir..io±o.7r, 

13.49±0.G2 




3.03 bO-10 
2.77±0.08 


VFBM 




14.20±0.G2 


9.20±0.91 


2.83±0.02 
















13.95±0.91 


9.4±0.46 


2. 81 ±11. 09 




K L5 


18.85±1.94 




3.39±0.28 


GSFBM 




15.63±1.82 


8.25±1.05 


2.97±fl.l7 






14.86±0.55 
17.4±0.fi 


8.03±0.42 
6.6±0.39 


2. 83 ±0.06 
3.Q7±0.06 






15.02il.G7 


9.56 . 1 .87 


3.17±0.13 


VDPM 




14.31±0.S5 


*.M_O.M 


2.83±fl.05 


K=50 


13.49±0.73 


!).iifi_0.:il 


2.78±0.10 






14.43±1.08 




2.77±fl.ll 


1 ISDPM 




14. 46 ±0.72 


8. L0 . 0.89 


2.83±0.05 



11.0!]±(].. r )S 



Table 4: The empirical results in percentage (%) of the four data sets (meanistd) 



under-estimation on 'MS Web' will be discussed later. 

Comparing the variational algorithm and Gibbs sampling, a big advantage 
of Gibbs sampling is that it is nonparametric, which means the problem of 
choosing the number of components can be left to the algorithm itself. Facing 
an unknown data set, choosing an appropriate K is difficult. One has to try 
several times to determine the K. The tests on the four test cases showed 
that the Gibbs sampling can find the proper number. The idea of choosing 
K automatically is simply as the following: create a new cluster if no existing 
cluster can fit the current data point significantly better than the average of the 
whole population. The DP mixture via the Gibbs sampler implements this idea 
in a stochastic way. Another advantage is in the accuracy of Gibbs sampler, 
as we do not need to make truncation and decoupling approximations as in 
variational method. However, a serious limitation of Gibbs sampler is its speed. 
As the Gibbs sampler generates a different number components each time, and 
due to the different convergence conditions of the methods, we cannot compare 
the time cost of the two method in a perfect "fair" manner. 

However, to illustrate the differences in time costs, we show a time cost 
analysis of dataset Accidents in Table [5l The training time costs per iteration 
of NBM, VFBM and VDPM do not increase when the K increases. The rea- 
son might be due to the optimized vector computation in Matlab, which makes 
the increasing of K less sensitive. The GSFBM and GSDPM involve sampling 
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Name 


Accidents 


Criteria 


Iterations 


T Off 


Toff/I 


NBM 


K=15 


15.4 


132.6 


8.61 


K=25 


16.8 


133.4 


7.94 


K=50 


15.6 


147.4 


9.45 


K=75 


14.8 


145.6 


9.84 


VFBM 


K=15 


18.4 


185.0 


10.06 


K=25 


18.2 


187.4 


12.86 


K=50 


18.0 


195.4 


10.86 


K=75 


16.4 


190.4 


11.57 


GSFBM 


K=15 


10.0 


261.8 


26.18 


K=25 


10.0 


276.2 


27.62 


K=50 


10.2 


344.4 


33.75 


K=75 


10.2 


408.0 


40.00 


VDPM 


K=15 


21.0 


211.2 


10.06 


K=25 


16.2 


187.2 


12.86 


K=50 


17.0 


184.6 


10.86 


K=75 


16.4 


189.8 


11.57 


GSDPM 


K=114.6 


130.2 


6812.2 


51.64 



Table 5: Total training time cost and training time per iteration of datasct Accidents (sec), 
average of 5 runs. Tpff is the time used for model training; Toff/I is the training time per 
iteration 

a multinomial distribution which cannot be handled as a vector operation in 
Matlab. Therefore, their training time cost per iteration is still relevant to K. 
Generally, NBM is the fastest, variational methods are a bit slower and sam- 
pling methods are the slowest. Although the training time cost of all methods is 
O(NDK), NBM does not involve any complex function evaluation. Variational 
methods need to calculate some functions such as logarithm, exponential and 
digamma function. The sampling methods need to calculate logarithm and ex- 
ponential functions and to generate random numbers. A more time consuming 
aspect to sampling is that it needs to update the parameters after each draw. 
However, we notice that the number of iterations used by GSFBM is less than 
that of variational methods. This is because the model of GSFBM is updated 
after each draw. It can be viewed as an online updating model. On the other 
hand, variational methods are both updated in the batch mode which normally 
takes more iterations to converge. The prediction time cost is simpler in com- 
parison with training cost. If the numbers of components of the different models 
are the same, the prediction time cost should be the same. 

Generally, the Bayesian models take more time than the non-Bayesian model 
for training. Among the Bayesian models, the variational methods are faster 
than the two sampling methods. The DP mixture model via Gibbs sampling 
is the slowest, however the benefit of this approach is that it does not require 
multiple runs to find the proper K. 
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Ton 


Eclat 


76.31 


K=15 


0.37 


K=25 


0.46 


K=50 


0.66 


K=75 


0.79 


K=114.6 


0.94 



Table 6: Itemset generating time cost of dataset Accidents (sec), average of 5 runs. To n is 
the generating time 

With the model prepared, the process of generating frequent itemsets by the 
model is much faster than the Eclat data mining. In most cases the itemset 
generation process is over 10 times faster than Eclat mining. As the model 
is irrelevant to the scale of the original dataset and the minimum frequency 
threshold. The probability models can save more time when we deal with large 
datasets or we need to mine the dataset multiple times with different thresholds. 

From the experiment results, we have found several interesting observations 
about mixture models for frequent itemset discovery. Firstly, as the false nega- 
tive rates are always much higher than the false positive rates, we observe that 
the mixture models tend to under-estimate the probabilities of the frequent item- 
sets. To clarify this, we calculate the empirical errors of the frequent itemsets 
in a more detailed way. We firstly classify the frequent itemsets into different 
categories by their lengths. Then we calculate the means of relative difference 
of all categories. We show the analysis of each data set with 50 components and 
the Gibbs sampling results in Figure [5j 

From Figure we can see a clear trend that the greater the lengths of the 
frequent itemsets are, the more the probabilities are under-estimated. The dif- 
ferences of the models' performances are the degree of under-estimation. Similar 
to the result showed in Table 0] the degrees of under-estimation of all Bayesian 
models are better than non-Bayesian mixture. 

Another observation is the significant difference of the models' performance 
between MS Web and the other three data sets. The under-estimation in MS 
Web is much more serious than the rest. Checking Table [TJ we notice that MS 
Web is much sparser than the other three. A further background investigation 
about the four data sets shows that the difference may be caused by the fact 
that the correlations between items within the three dense data sets are much 
stronger than in MS Web, which is sparse. Therefore the distribution of these 
data records can be better approximated by a mixture model structure. More 
improvements for the mixture model may be required to achieve a satisfying 
performance for sparse data sets. 

We think the reason for under-estimation is that the mixture model is a 
mixture of independent models. In independent Bernoulli model, the proba- 
bilities of patterns are simply the multiplications of the parameters, which are 
always under-estimating the correlated item combinations. In mixture models, 
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compensations are made by the assumption of conditional independence. Cor- 
relations of the data sets are contained by different components and described 
by the various group of conditional probabilities. For strongly correlated data 
sets such as classification data, feature attributes of each class would show high 
dependencies. These dependencies are strong and simple because the correlated 
attributes are clustered by the latent classes. Under these circumstances, most 
correlations are represented by the model thus the under-estimation is tolera- 
ble. However, for non-classification data sets where the correlations are not so 
strong and relatively loose and chaotic, the mixture model cannot hold all the 
complexity with a feasible number of components. We think this explains why 
for MS Web data there are severely under-estimation for all three models. 

Generally, comparing with classic frequent itemset mining, a well-trained 
probability model has following benefits. Firstly, the mixture model can inter- 
pret the correlation of the data set and help people understand the data set 
while the frequent itemsets is merely a collection of facts which still need to be 
interpreted. A probability model can handle all the kinds of probability queries 
such as joint probabilities, marginal probabilities and conditional probabilities 
while frequent itemset mining and association rule mining only focus on high 
marginal and conditional probabilities. Furthermore, interesting dependencies 
between the items, including both positive and negative, are easier to observe 
from the model's parameters than to discriminate interesting itemsets or rules 
from the whole set of frequent itemsets or association rules. A second benefit 
is that generating a set of frequent itemsets is faster than mining the data set 
if the model is trained. Here we use 'chess' as an example since in our case the 
whole set of frequent itemsets includes 1262028 itemsets thus the mining time 
is long enough. The average data mining time by Eclat is about 25 seconds 
while the generation time from the well-trained mixture model takes less than 
10 seconds. With the same searching framework, frequent itemset mining ob- 
tain the frequency by scanning the database or maintaining a cache of the data 
set in memory and counting while mixture model generates the probability by 
various times of multiplications and summation once. At last, the model can 
serve as a proxy of the entire data set as the model is normally much smaller 
than the original data set. 

7. Conclusion 

In this paper, we applied finite and infinite Bayesian mixture models via two 
methods to the frequent itemsets estimation problem. Comparing with earlier 
non-Bayesian models, Bayesian mixture model can improve estimation accuracy 
without involving extra model complexity. DP mixture via Gibbs sampler can 
reach a even better accuracy with proper number of components generated 
automatically. We tested the Bayesian models and non-Bayesian mixture models 
on 5 synthetic data sets and 4 benchmark data sets, the experiments showed 
that in all cases the DP models over performed the non-Bayesian model. 

Experiments also showed that all mixture models had the trend of under- 
estimating the probabilities of frequent itemsets. The average degree of undcr- 
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Figure 5: Relative difference between the models and true frequencies 



estimation increases by the increasing of lengths of the frequent itemsets. For 
sparse data sets, all mixture models' performances are poor because of the 
weak correlation between items. Thus one possible avenue for further work 
would be the use of probability models which explicitly represent sparsity. We 
observe that the performance improves as the number of components increases, 
suggesting some degree of underfitting. Throughout this work we assume that 
the data was a mixture of transactions of independent models. An alternative 
approach would be to assume that each transaction is a mixture [32| or model 
the indicators distribution as a mixture [33]. This might fit the data better. 



8. Appendix 

Here we briefly review the process of EM algorithm for non-Bayesian mixture 
model. For all transactions in the data set, if we apply the logarithm, Equation 
1(2} becomes the log-likelihood of the model: 



ln£(0|T) = lnp(T|0) 



N 



K 

E 

k=l 



D 

8 = 1 



ik 



!j ik : 



\l-xf 



(45) 



However the log-likelihood is hard to optimize because it contains the log of the 
sum. The trick is treating Z as a random variable. For any distribution q{Z), 
the following equation holds: 



lnp(T|0) = £>(2)lnp(T|0) = L(q,&) + KL(q\\p), 



(46) 
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where ~£z ~ E*=i • ' • E* =1 • ' • E^=i and 

£( g ,Q)=^ g (Z)ln P(r 'g e) (47) 
z "\ ) 

KL( q \\ P ) = -J2<l(2)^ P{Z ^ z) &) (48) 

In Equation (|46l) . l?X(g|[p) is the Kullback-Leibler divergence (KL divergence) 
between q(Z) and the true posterior distribution p(Z\T,&). Recall that for 
any distribution, the KL divergence KL(q\\p) > 0, with equality if and only if 
p{Z) — p(Z\T,®)- Therefore based on Equation (l46l) . we have lnp(T\&) > 
L(q, 0). Thus, L(q, 0) can be regarded as a lower bound of the log-likelihood. 
We can maximize the likelihood by maximizing L(q, 0). For q{Z) we assume it 
follows a multinomial distribution form: 

N 

q(Z) = Y[ q(z M ), where q(z^) ~ Multinomial (V 1 ), (49) 
/i=i 

Thus, we could expand L(g, 0): 

L(g, 0) =^ g(Z) lnp(T|Z, 0) + £ q(Z) lnp(Z|0) 

-]T<z(Z)lng(Z) (50) 
z 

All the terms involve standard computations in the exponential family, and the 
following optimization of the parameters could be solved by a classic multivariate 
function maximization with constraints. 



u_ KkUi^i&ik C 1 -<t>ik) 



(51) 



1 N 



E 



(53) 



Equation (J52J) and (I53[) depend on t£ and Equation (|5ip depends on 7Tfc and cj>n. , 
so the optimizing process alternates between two phases. After the model ini- 
tialization, first we compute t£ according to Equation ([ST]) . This step is called 
the E-step (Expectation-step). In this step q{Z) is set to equal p{Z\T, ® old ), 
causing the lower bound L(q, oW ) to increase to the same value as the log- 
likelihood function lnp(T|0 oW ) by vanishing the Kullback-Leibler divergence 
KL(q\\p). Then we compute iTk and <f>ik according to Equation (|52[) and (|53[) . 
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This step is called the M-step (Maximization-step). In this step, q{Z) is fixed 
and the lower bound L(q,® old ) is maximized by altering & old to ® new . As 
the KL divergence is always non-negative, the log-likelihood function \np(T\®) 
increases at least as much as the lower bound does. The EM algorithm iter- 
ates the two steps until convergence. A more detailed introduction about EM 
algorithm is given by [2~i| . 

References 

[1] R. Agrawal, T. Imielihski, A. Swami, Mining association rules between sets 
of items in large databases, SIGMOD Rec. 22 (1993) 207-216. 

[2] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, Discovering frequent closed 
itemsets for association rules, in: C. Beeri, P. Buneman (Eds.), Database 
Theory ICDT99, volume 1540 of Lecture Notes in Computer Science, 
Springer Berlin / Heidelberg, 1999, pp. 398-416. 

[3] T. Calders, B. Gocthals, Mining all non-derivable frequent itemsets, in: 
T. Elomaa, H. Mannila, H. Toivonen (Eds.), Principles of Data Mining and 
Knowledge Discovery, volume 2431 of Lecture Notes in Computer Science, 
Springer Berlin / Heidelberg, 2002, pp. 1-42. 10.1007/3-540-45681-3-7. 

[4] I. B. Machine, IBM intelligent miner users guide, version 1, release 1, 1996. 

[5] S. Brin, R. Motwani, C. Silverstein, Beyond market baskets: generalizing 
association rules to correlations, SIGMOD Rec. 26 (1997) 265-276. 

[6] S. Jaroszewicz, Interestingness of frequent itemsets using bayesian networks 
as background knowledge, in: In Proceedings of the SIGKDD Conference 
on Knowledge Discovery and Data Mining, ACM Press, 2004, pp. 178-186. 

[7] N. Tatti, Maximum entropy based significance of itemsets, Knowl. Inf. 
Syst. 17 (2008) 57-77. 

[8] C. Chow, C. Liu, Approximating discrete probability distributions with 
dependence trees, Information Theory, IEEE Transactions on 14 (1968) 
462 - 467. 

[9] J. Kruskal, Joseph B., On the shortest spanning subtree of a graph and the 
traveling salesman problem, Proceedings of the American Mathematical 
Society 7 (1956) pp. 48-50. 

[10] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plau- 
sible Inference, Morgan Kaufmann Publishers Inc., 1988. 

[11] D. Pavlov, H. Mannila, P. Smyth, Beyond independence: Probabilistic 
models for query approximation on binary transaction data, IEEE Trans- 
actions on Knowledge and Data Engineering 15 (2003) 1409-1421. 



31 



[12] N. Tatti, M. Mampaey, Using background knowledge to rank itemsets, 
Data Min. Knowl. Discov. 21 (2010) 293-309. 

[13] B. Everitt, D. J. Hand, Finite mixture distributions / B.S. Everitt and D.J. 
Hand, Chapman and Hall, London ; New York :, 1981. 

[14] T. S. Ferguson, A bayesian analysis of some nonparametric problems, The 
Annals of Statistics 1 (1973) pp. 209-230. 

[15] M. J. Wainwright, M. I. Jordan, Graphical models, exponential families, 
and variational inference, Technical Report, Dept. of Statistics, 2003. 

[16] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions and the 
Bayesian restoration of images, IEEE Transactions on Pattern Analysis 
and Machine Intelligence 6 (1984) 721-741. 

[17] N. Metropolis, S. Ulam, The monte carlo method, Journal of the American 
Statistical Association 44 (1949) pp. 335-341. 

[18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller, 
Equation of State Calculations by Fast Computing Machines, The Journal 
of Chemical Physics 21 (1953) 1087-1092. 

[19] W. K. Hastings, Monte Carlo sampling methods using Markov chains and 
their applications, Biometrika 57 (1970) 97-109. 

[20] A. P. Dempster, N. M. Laird, D. B. Rubin, Maximum likelihood from 
incomplete data via the em algorithm, Journal of the Royal Statistical 
Society. Series B (Methodological) 39 (1977) pp. 1-38. 

[21] C. M. Bishop, Pattern Recognition and Machine Learning (Information 
Science and Statistics), Springer, 1st ed. 2006. corr. 2nd printing edition, 
2007. 

[22] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 
6 (1978) 461-464. 

[23] R. Agrawal, R. Srikant, Fast algorithms for mining association rules in 
large databases, in: Proceedings of the 20th International Conference on 
Very Large Data Bases, VLDB '94, Morgan Kaufmann Publishers Inc., San 
Francisco, CA, USA, 1994, pp. 487-499. 

[24] M. Zaki, Scalable algorithms for association mining, Knowledge and Data 
Engineering, IEEE Transactions on 12 (2000) 372 -390. 

[25] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate gener- 
ation, in: Proceedings of the 2000 ACM SIGMOD international conference 
on Management of data, SIGMOD '00, ACM, New York, NY, USA, 2000, 
pp. 1-12. 



32 



[26] M. J. Beal, Variational Algorithms for Approximate Bayesian Inference, 
Ph.D. thesis, University of London, 2003. 



[27] D. Blackwell, J. B. Macqueen, Ferguson distributions via Polya urn 
schemes, The Annals of Statistics 1 (1973) 353-355. 

[28] J. Sethuraman, A constructive definition of Dirichlet priors, Statistica 
Sinica 4 (1994) 639-650. 

[29] D. M. Blci, M. I. Jordan, Variational inference for dirichlet process mix- 
tures, Bayesian Analysis 1 (2005) 121-144. 

[30] A. Frank, A. Asuncion, UCI machine learning repository, 
http : //archive . ics . uci . edu/ml, 2010. 

[31] C. Geurts, G. Wets, T. Brijs, K. Vanhoof, Profiling high frequency acci- 
dent locations using association rules, in: proceedings of the 82nd Annual 
Transportation Research Board, Washington DC. (USA), January 12-16, 
p. 18pp. 

[32] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, J. Mach. 
Learn. Res. 3 (2003) 993-1022. 

[33] Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blci, Hierarchical dirichlet 
processes, Journal of the American Statistical Association 101 (2004). 



33 



